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Abstract 

The realization that string theory gives rise to a huge landscape of vacuum 
solutions has recently prompted a statistical approach towards extracting phe- 
nomenological predictions from string theory. Unfortunately, for most classes 
of string models, direct enumeration of all solutions is not computationally 
feasible and thus statistical studies must resort to other methods in order to 
extract meaningful information. In this paper, we discuss some of the issues 
that arise when attempting to extract statistical correlations from a large data 
set to which our computational access is necessarily limited. Our main focus 
is the problem of "floating correlations". As we discuss, this problem is en- 
demic to investigations of this type and reflects the fact that not all physically 
distinct string models are equally likely to be sampled in any random search 
through the landscape, thereby causing statistical correlations to "float" as a 
function of sample size. We propose several possible methods that can be used 
to overcome this problem, and we show through explicit examples that these 
methods lead to correlations and statistical distributions which are not only 
stable as a function of sample size, but which differ significantly from those 
which would have been naively apparent from only a partial data set. 
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1 Introduction 



Over the past few years, it has become increasingly clear that string theory gives 
rise to a very large number of vacuum solutions PQ. Because the specific low-energy 
phenomenology that can be expected to emerge from the string depends critically 
on the particular choice of vacuum state, detailed quantities such as particle masses 
and mixings — and even more general quantities and structures such as the choice of 
gauge group, number of chiral particle generations, magnitude of the supersymmetry- 
breaking scale, and the cosmological constant — can be expected to vary significantly 
from one vacuum solution to the next. Thus, in the absence of some sort of vacuum 
selection principle, it has been proposed that meaningful phenomenological predic- 
tions from string theory might instead be extracted statistically, through the discov- 
ery of statistical correlations across the huge "landscape" of string vacua j^j- Such 
string-derived correlations would relate different phenomenological features that are 
apparently unrelated in field theory, and would thus represent string-theoretic pre- 
dictions that hold for the majority of string vacua. 

Unfortunately, the space of possible vacua is extremely large, with some estimates 
putting the number of phenomenologically interesting vacua at 10 500 or more [2|. 
Direct computational access to this large data set is therefore virtually impossible, 
and one is forced to undertake statistical studies of a more limited nature. 

To date, there has been considerable work in this direction [^UHlSElinilZllEllH]; 
for reviews, see Refs. [TUl HI] . Collectively, this work has focused on different classes 
of string models, both closed and open, employing a number of different underlying 
string constructions and formulations. However, regardless of the particular string 
model or construction procedure utilized, any such statistical analysis can be char- 
acterized as belonging to one of three different classes: 

• Abstract studies: First, there are abstract mathematical studies that proceed 
directly from the construction formalisms (e.g., considerations of flux combi- 
nations). Although large sets of specific string models are not enumerated or 
analyzed, general expectations and trends are deduced based on the statistical 
properties of the parameters that are relevant in these constructions. 

• Direct enumeration studies: Second, there are statistical studies based on 
direct enumeration of finite subclasses of string models. Within these well- 
defined subclasses, one enumerates literally all possible solutions and thereby 
collects statistics across a large but finite tractable data set. 

• Random search studies: Finally, there are statistical studies that aim to 
explore a data set which is (either effectively or literally) infinite in size. Such 
studies involve randomly generating a large but finite sample of actual string 
models and then analyzing the statistical properties of the sample, assuming 
the sample to be representative of the class of models under examination as a 
whole. 
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Indeed, all three types of studies have been undertaken in the literature. 

Certain difficulties are inherent to all of these approaches. For example, in each 
case there is the over-arching problem of defining a measure in the space of string 
solutions. We shall discuss this problem briefly below, but this is not the chief concern 
of the present paper and for simplicity we shall simply assume that each physically 
distinct string model is to be weighted equally in any averaging process. 

By contrast, other difficulties are specifically tied to individual approaches. For 
example, the first approach has great mathematical generality but often lacks the 
precision and power that can come from direct enumerations of actual string models. 
Likewise, the second approach is fundamentally limited to classes of string models 
for which a full enumeration is possible — i.e., string constructions which admit a 
number of solutions which is both finite and accessible with current computational 
power. 

For these reasons, the third approach might ultimately seem to have the best 
prospects for generating precise statistical statements about a relatively large string 
landscape. As has been recently shown, the problem of directly enumerating certain 
classes of string models is actually NP-complete [3]. This fact implies that our com- 
putational access to the string landscape will always be quite limited, which in turn 
suggests that random search studies may be more efficient for exploring the string 
landscape. Indeed, most large-scale census studies are of this type. 

Although significant effort has been devoted to studying the algorithms and issues 
facing direct enumeration studies, relatively little effort has been invested in studying 
the issues facing random search studies. In this paper, we shall present some elemen- 
tary observations concerning some of the potential pitfalls of such studies, and the 
methods by which they can be overcome. 

Clearly, one fundamental difficulty is that one must assume that the sample set 
of string vacua is representative of the relevant class of string vacua as a whole. To 
attempt to ensure this, one typically generates these sample sets as randomly as 
possible from amongst the functionally infinite set of allowed solutions in the class. 
One therefore assumes that no bias has been introduced into this procedure. However, 
as we shall discuss in this paper, there is a unique alternative kind of bias which is 
nearly inevitable in random searches through the string landscape. Moreover, as we 
shall explain, this bias leads directly to the problem of "floating correlations" . This 
in turn leads to tremendous distortions in the statistical correlations that one would 
appear to extract through such studies. 

In this paper, we shall begin by discussing the origins of this phenomenon. We 
shall then discuss various means by which it may be overcome. Finally, we shall 
present an explicit example drawn from studies of the heterotic landscape which illus- 
trates that these issues, and their resolutions, can dramatically alter the magnitudes 
of the correlations that one would naively appear to extract from the landscape. 
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2 The problem of floating correlations 

In general, there are many different construction formalisms which may be em- 
ployed in order to build self-consistent string models. For example, closed string mod- 
els may be constructed through orbifold techniques (with or without Wilson lines), 
or alternatively using geometric techniques (e.g., by specifying particular Calabi-Yau 
compactifications). There are also generalized conformal field theory techniques (such 
as those utilized in Gepner constructions), or special cases of these which involving 
only free worldsheet bosons or fermions with different boundary-condition phases. 
Similar choices exist as well for open strings, where one can have, e.g., intersecting 
D-branes, fluxes, and so forth ^2]- Not all construction formalisms are distinct, and 
the sets of models which can be realized through each construction technique can 
often have significant overlaps. 

Within each construction formalism, there are certain free parameters which one 
is free to choose; we shall collectively label these internal parameters {xi}. These may 
be compactification moduli, boundary-condition phases, Wilson-line coefficients, or 
topological quantities specifying Calabi-Yau manifolds; likewise they might be D- 
brane dimensionalities and charges, wrapping numbers, or intersection angles. We 
may also include among the set {x^ the vevs of moduli fields and/or fluxes which 
are necessary for guaranteeing stable (or at least sufficiently flat) vacuum solutions. 
As long as these internal parameters are chosen to satisfy whatever self-consistency 
constraints are inherent to the relevant construction method (such as those stemming 
from conformal invariance and modular invariance in the case of closed strings, or 
anomaly and tadpole cancellations in the case of open strings), one is guaranteed to 
have constructed a bona-fide string model. 

However, regardless of the particular construction formalism employed, one can- 
not generally define a given string model as being distinct from all others on the basis 
of an examination of these parameters {x{\. Rather, one must deduce the spacetime 
properties of the resulting string model in order to deduce whether this model is 
truly unique when compared with another. Such spacetime properties might include, 
for example, the gauge group, the number of spacetime supersymmetries, the entire 
particle spectrum, and the associated couplings. Collectively, we can describe these 
spacetime properties as belonging to a set of spacetime parameters {yj}. If any of 
these y-parameters are different for two candidate models, we say that the two can- 
didate models are truly distinct — i.e., that we truly have two models. Of course, 
the parameters {yj} are not independent of each other (as they might be in field 
theory), but are presumably correlated by the fact that they emerge from a given 
self-consistent string model. These are the types of correlations that one ultimately 
hopes to extract as string predictions from the landscape. 

In general, each construction technique provides a recipe or prescription for start- 
ing with a self-consistent set of parameters {x^ and generating a corresponding set 
of spacetime parameters {yj}. In other words, each construction formalism implicitly 
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provides us with a set of functions fj such that 

Vi = /*({**}) ■ (2-1) 

However, deriving the exact explicit form of such functions is a formidable task, and 
it is not always possible to extract these functions explicitly from the underlying 
construction method. What is important for our purposes, however, is that such 
functions represent the dependence of the spacetime y-parameters on the internal 
x-parameters. 

Although not much is generally known about such functions fj, one thing is clear: 
these functions are not one-to-one. Rather, there exist numerous redundancies ac- 
cording to which different combinations of {x{\ can lead to exactly the same 
In general, such redundancies exist because of a variety of factors. Sometimes, there 
are underlying identifiable worldsheet symmetries (often of a geometric nature, e.g., 
mirror symmetries) which cause two different constructions to lead to the same phys- 
ical string model. In such cases, these redundancies are well-understood and can 
perhaps be quantified and eliminated from the model-construction procedure, but 
this process becomes extremely intractible and inefficient for sufficiently complicated 
models. In other cases, however, there may simply be redundancies in the chosen 
construction formalism such that different combinations of parameters can result in 
the same physical string model in spacetime. For example, it often happens that two 
unrelated sets of orbifold twists and Wilson lines can result in the same string model 
even when there is no apparent geometric connection between them. Regardless of 
the cause, however, the important point is that the mapping between the internal 
x-parameters and the spacetime ^/-parameters is not one-to-one. We therefore are 
faced with the situation sketched in Fig. [TJ 

This feature can have devastating consequences for a random search through 
the space of string models. Because any such search must be tied to a particular 
construction technique, one cannot simply survey the model space of self-consistent 
{yj}', rather, one is forced to survey the parameter space {xi}. This means that we 
do not have direct access to the model space in which each model is weighted equally; 
rather, we only have access to deformation of this model space in which models with 
multiple x-representations occupy a larger effective volume than those with fewer 
x-representations. We may refer to this deformed model space as a probability space, 
since each model in the probability space shall be defined to occupy a volume which is 
proportional to its probability of being selected through a generation of self-consistent 
x-parameters. This is illustrated in Fig. |21 

This can lead to three potential types of bias in a random model search. The first 
two are relatively obvious and straightforward to deal with: 

• First, one may not be sampling the parameter space in a truly random way. In- 
deed, the selection of x-parameters may be skewed as the result of a systematic 
algorithmic or computational bias. However, this kind of bias is not the focus 
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Figure 1: Each string model-construction technique provides a mapping between a space of 
internal parameters (such as compactification moduli) and a physical model in spacetime. 
However, this mapping is not one-to-one, and there generally exists a huge redundancy 
wherein a single physical string model in spacetime (such as Model A in the figure) can have 
multiple redundant realizations in terms of internal parameters. For this reason, the space 
of internal parameters is usually significantly larger than the space of obtainable distinct 
models. The shaded region represents models which, though entirely self-consistent, are 
not realizable through the construction technique under study. 

of this paper, and we shall assume that our computational algorithms provide 
a truly random sampling of model-construction parameters. (In any case, the 
methods we shall eventually be developing in this paper can compensate for a 
bias of this type as well.) 

• Second, one might be oversampling models for which there exist multiple in- 
ternal realizations. For this reason, it is necessary to ensure that one does not 
consider a given string model more than once in the random search process. In 
other words, each time a self-consistent set of x-parameters is generated, one 
must calculate the corresponding y-parameters and verify that these parameters 
do not match those of any other model which has previously been considered in 
the same sample. While conceptually straightforward, this requirement is com- 
putationally and memory intensive since it requires that any search procedure 
maintain a cumulative, readable memory of all models that have already been 
constructed in the sample. Indeed, we have found that this feature alone tends 
to provide the most severe limitations on the sizes of string model samples that 
can feasibly be generated. 
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Figure 2: Illustration of the difference between the model space and the probability space, 
with total volumes f2 m odei arid O pro b respectively. Each box represents a distinct string 
model. In the model space, each model occupies the same volume, whereas in the probability 
space, each model occupies a volume which is proportional to its probability of production. 
Note that the shaded regions of Fig. ^ have zero probability of being produced and thus do 
not appear in the probability space at all. 



Thus, while these types of bias are important, both can easily be addressed. 

However, the third type of bias is more subtle and is the focus of this paper. In 
some sense, this problem is the reverse of the second problem itemized above: some 
models are relatively hard to generate in terms of appropriate {x{\. Of course, this 
would not be an issue if the redundancy indicated in Fig. ^ were relatively evenly 
distributed across the model space. Counting each model with a multiple redundancy 
only once would then eliminate all bias. However, it turns out that some models have 
redundancies which are greater than those of other models by many, many orders of 
magnitude. What this means in practice is that when one is randomly sampling the 
parameter space, one easily "discovers" models such as Model A in Fig.Qwhile never 
finding models such as Model B. Thus, while models such as Model A are likely 
to be included in any random sample of string models, models such as Model B are 
almost certain to be missed. Indeed, in a typical search, we are not likely to have the 
computational power to probe even the full set of highly likely models. Thus we are 
almost certain to under-represent the relatively unlikely models, assuming we find 
such models at all. 

This kind of disparity is of little consequence if all physical properties of interest 
are evenly distributed across the model space. For example, if we are interested 
in knowing what fraction of models have chiral spectra, this kind of disparity will 
be irrelevant if the chirality property is uncorrelated with the redundancy property. 
However, it is usually the case that the very same underlying features which create the 
hierarchy of redundancies for different string models also lead to uneven distributions 
with respect to their physical properties. For example, a given string construction 
method may easily yield a set of models with a given property (e.g., a gauge group 
of a given large, fixed rank), and yet be capable of yielding models that do not 
have that property (e.g., which exhibit rank-cutting) in some carefully fine-tuned 
circumstances. Thus, if we are generating a sample set of models, we are likely to 
miss those "rare" models until our sample size becomes extremely large. 

The implications of this can be rather severe. // the physical property about which 
we are seeking statistical information happens to correlate with this redundancy, then 
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our statistical correlations or percentages will necessarily evolve (or "float") as a func- 
tion of the sample size. Even worse, because we cannot hope to approach a complete 
saturation of the model space and because we have little guidance as to the sizes 
or patterns of redundancy in our model-construction procedures, we cannot obtain 
meaningful statistics by generating more models and waiting for this floating process 
to become stable. We emphasize that this is a problem that must be faced regard- 
less of our choice of model-construction technique and regardless of how carefully we 
construct randomized algorithms for model generation. 

Thus, we may summarize this problem as follows. Each time we construct a self- 
consistent set of x- variables {x^}, we examine the corresponding y- variables to see 
if we have really constructed a new model that we have not seen before. If so, we 
add it to our sample set of models; if not, we disregard the set {x{\ and generate 
another. Very soon, we reach a stage at which models with some physical properties 
are "common" , and models with other physical properties are "rare" . However, it is a 
priori impossible to determine what percentages of models might be "common" and 
what percentages are "rare" on the basis of this sample set. The problem is that if 
we keep generating new candidate sets {x^}, we will tend not to generate any further 
models of the "common" variety because they will have already been generated. 
In other words, each additional distinct model that we generate has an increasing 
probably of being "rare" , which is why it is distinct from those that have already 
been constructed. Thus, rare properties tend to become less rare as the sample size 
increases, which causes our statistical correlations to float as functions of the sample 
size. Indeed, in most realistic situations, this problem can be further compounded 
by the fact that physically interesting properties such as spacetime SUSY, gauge 
groups, numbers of chiral generations, and so forth may be differently distributed 
across models with varying intrinsic probabilities of being selected. This too causes 
our statistical correlations to float as functions of the sample size. 

This, then, is the problem of floating correlations. What is required is a means of 
overcoming this type of bias and extracting statistical information, however limited, 
from such a model search. 

3 Modelling the model search: Drawing balls from an urn 

It will help to develop a mathematical model for the process of randomly exploring 
the model space. Towards this end, let us begin by imagining a big urn filled with 
balls of different colors and compositions. For example, some of the balls are red, 
while others are blue; likewise, some of the balls are plastic, while others are rubber. 
Each ball shall correspond to a distinct string model. Thus, exploring the string 
model space through the random generation of string models becomes analogous to 
the act of drawing a ball from the urn, noting its properties, marking it for future 
identification, replacing the ball in the urn, mixing, and then repeating over and over. 
Of course, since we replace each ball after we have drawn it, each draw is independent. 
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If the model-generation method is truly random without any inherent biases, then 
each ball will have the same basic probability to be drawn regardless of its properties. 
We shall examine this case in detail first, and then consider more realistic cases where 
the model-generation method is biased. 

Clearly, each draw from the urn need not result in a new ball because there is a 
possibility that we will draw a ball that has already been seen. Thus, after D draws 
from the urn, we will have found a number of models M(D) which we expect to be 
smaller than D. Although M(D) is restricted to be an integer, it is straightforward 
to derive an expression for the expectation value (M(D)). If we have a total of TV 
different balls in the urn (so iV distinct models are realizable by our construction 
method), then the probability of drawing any specific ball is simply P = 1/N. Since 
we are exploring the model space randomly, the difficulty of finding a new ball will 
be related to how fully explored the model space already is. If we have already seen 
x distinct balls, then the probability that a new draw will yield a previously unseen 
ball is given by 

Pn P w = 1-Jj . (3.1! 



Given this, we can determine (M(D)) recursively. If (M(D)) is already known, then 
clearly 
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(3.2) 



Here the first term on the first line reflects the contribution from the possibility 
that the next draw yields a new ball hitherto unseen, while the second term reflects 
the possibility that it does not; moreover, in passing to the second line we have 
replaced M(D) by (M(D)). This recursion relation, along with the initial condition 
(M(0)) = M(0) = 0, allows us to solve for (M(D)) exactly: 



(M(D)) 
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and for iV > 1 we may approximate this as 



{M(D)) « JV(1 - e- D/N ) 



(3.3) 



(3.4) 



This has the basic behavior we expect; indeed, calculations along these lines have 
appeared more than a decade ago in Ref. [THj. When the model space is relatively 
unexplored, it is not difficult to find a new model, but as the model space becomes 
more explored it gets harder to find new models. The main feature to note here is 
that the total number N of distinct models appears both as a multiplicative factor 
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and in the exponent in this expression. This only happens when all models have an 
equal probability to be generated. 

Unfortunately, as we have discussed in Sect. 2, it will typically be the case that 
different models will have different probabilities of being generated. Indeed, as dis- 
cussed in Sect. 2, what we are exploring randomly is typically not the model space, 
but rather the probability space. 

In order to account for this, let us now modify the above analysis by imagining that 
each ball in the urn has a different intrinsic probability of being drawn from the urn, 
and that this probability depends on its composition. For example, we may imagine 
that plastic balls are intrinsically smaller or lighter than rubber balls, and thus have 
a smaller cross section for being selected when we reach into the urn. In general, we 
shall let pi denote the relative intrinsic probability that a ball in population i will be 
selected on a random draw, and we shall let Ni denote the sizes of these populations. 
For example, if Npiastic/Nrubber = 1/3 but Ppiastie/Prubber = 1/4, then a plastic ball will 
be 1/12 as likely to be drawn from the urn as a rubber ball. We shall assume that 
all of the balls with a common composition i share a common intrinsic probability 
Pi of being selected, but we shall make no assumption about the number of such 
populations. Also note that only ratios of the different Pi shall matter, so there is no 
need to normalize the Pi in any particular fashion.* 

As illustrated in Fig. each model occupies an equal volume in the model space 
but only a rescaled volume in the probability space; the probabilities Pi describe 
these rescalings. Indeed, the total volumes of the model and probability spaces can 
be defined as 

i i 

Of course, with this definition f2 pro b will scale with the overall normalization of the 
Pi, but this will not be relevant in the following. What is important, however, is that 
the probability space will be different from the model space if all of the pi are not 
identical. Thus, the volume relations amongst populations with different pi will be 
different in the two spaces. However, by construction, the volume relations amongst 
models with the same Pi will be the same in both the model and probability spaces. 

We are now in a position to address what will happen as this model space is 
explored. By definition, the probability of drawing a ball from a given population is 

*We emphasize that in the case of actual string model-building, models do not have an intrinsic 
Pi except in the presence of a particular model-generation technique. Thus, the piS are associated 
not only with a given set of models, but also with a specific model-generation technique. As a 
practical matter, however, one must always have a formalism through which to generate models, so 
it is sufficient to associate the p^s with the models themselves, as we have done with this ball/urn 
analogy. We also note that even if there is a bias within the model-generation technique, so that 
the parameter space in Fig. ^ is not explored truly randomly, this effect can also be incorporated 
within the probabilities pi so long as each parameter combination is explored at least once. Thus, 
the methods that we shall be developing for overcoming production biases can overcome this type 
of bias as well. 
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directly related to the volume occupied in the probability space by that particular 
population. For a given population i, this probability is Pj = PiNi/Q pTO b- However, 
the probability of finding a new, previously unseen ball within the population i will 
also depend on the number X{ of distinct balls from population % which have already 
been found: 

P± = 7T~ (N t - x<) . (3.6) 

^ ^prob 

Here the notation P (rather than P) indicates that the probability P in Eq. ()3.6|) is 
entirely unrestricted, i.e., there is no prior assumption that the draw will even select 
a model from the i-population. Using this equation, we can follow our previous steps 
to calculate the expected number of distinct i-models (Mi(D)) that will be found 
after D draws from the urn. Our recursion relation takes the form 



(Mi(D + l)) = (Mi(D)) 



1 Pi 



+ , (3.7) 

» 'prob 



^prob 

and with the initial condition (Mj(0)) = Mj(0) = we find the solution 



(Mi(D)) = N t 




D' 



N . (i _ e -D Pi /n pioh ^ _ ^ 



Note that the prefactor no longer matches the factor in the exponential. 

Eqs. ()3.6|) and ()3.8|) give us a general sense of when different populations of models 
will be found. The populations with the largest Pi will begin to be explored first 
simply because they have the larger probabilities of being selected. This will then 
increase Xi, which in turn makes it more difficult to find new models in this population. 
Subsequent new models that are found will then start to preferentially come from 
populations with pj < p t . Indeed, only when 

will the probability of drawing a new model from the model space of population j be 
equal to that of drawing a new model from the model space of population i. Thus, 
the exploration of model spaces with smaller pi will always lag the exploration of 
spaces of models with larger pi. 

Note that Eq. (|3.8j) describes the growth of the individual quantities (M*(D)) as 
functions of the number D of draws. For this purpose, any selection from the urn 
counts as a draw. However, for some purposes, it is also useful to define the restricted 
draw di which denotes the number of times a ball from population % is drawn (again 
regardless of whether this ball has previously been seen). Each time D increases by 
one, we can be certain that one and only one of the di increases by one because our 



10 



probability populations are disjoint. However, the expectation value (di) will be given 
by 

(di) = ^D. (3.10) 

^ ^prob 

Using (di), we can therefore rewrite Eq. ()3.6|) in the form 

1 \ (<k) 



(Mi(D)) = Ni 



« ^(l-e-<*>/^) . (3.11) 



Of course, as expected, these results have the same forms as Eqs. (J3.3J) and ()3.4|) 
since the use of the restricted draw (di) allows us to consider each population as 
truly separate in the drawing process. 

Even at this stage, we have still not completely modelled the string model- 
exploration process. This is because we cannot assume that the physical charac- 
teristics of a given model (such as its degree of supersymmetry, the rank or content 
of its gauge group, the chirality of its spectrum, or its number of fermion generations) 
are correlated in any way with its probability of being drawn. Or, to continue with 
our analogy of the balls in the urn, even though the plastic balls may be smaller 
or lighter than the rubber balls (thereby giving the plastic balls a smaller intrinsic 
probability pi of being drawn than the rubber balls), the physical characteristics of 
the string model may correspond to a completely independent variable such as the 
color of the ball. Some balls may be red and some balls may be blue, and we have 
no reason to assume that all red balls are plastic or that all blue balls are rubber. 
In the following, therefore, we shall continue to let the composition i of the balls 
represent their probabilities of being selected, but we shall also let the color a of 
the ball (red, blue, etc.) denote its physical characteristics. This is consistent with 
the conventions in Fig. |2l where different colors / shadings denote different physical 
characteristics while size rescalings denote different probabilities of being drawn. 

Note that while the different probability populations are necessarily disjoint, the 
physical characteristic classes need not be disjoint at all. For example, two classes 
a and f3 may have a partial overlap, such as would occur if characteristic a denotes 
the presence of an SU(3) gauge-group factor while (3 denotes the presence of N = 1 
spacetime SUSY; alternatively, one class may be a subset of another, as would occur if 
a denotes the presence of SU(3) while j3 denotes the presence of the entire Standard- 
Model gauge group. All that is required in our formalism is that each class correspond 
to a set of models exhibiting a well-defined set of particular physical characteristics. 

Given this, the populations will generally fill out a population matrix N ai . More- 
over, given this population matrix, it is then straightforward to determine the average 
expected numbers of distinct models with particular sets of physical characteristics: 



(M a (D)) = £iV Q 



1-1 Pi ^ 



prob , 



Y,N ai (l-e- Dp * /n v™ b ) . (3.12) 
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Thus, as we draw balls from the urn and note their physical properties, we expect the 
numbers of distinct models exhibiting particular physical properties to grow as a sum 
of weighted exponentials, where each exponential is weighted by its own population 
size N a i and where the "time constant" for each exponential is related to a unique 
probability fraction pi. 

Clearly, each different physical characteristic a can be expected to have its own 
unique pattern of growth for (M a (D)) as a function of D. However, it may occasion- 
ally happen that two different physical characteristics a and (3 will nevertheless give 
rise to quantities (M a (D)) and (Mp(D)) which share the same overall behavior as 
functions of their arguments, with at most only an overall rescaling between them. 
In such cases, we shall say that a and (3 are in the same universality class. It is 
straightforward to see that if 



for all , (3.13) 



i.e., if the a-row of the population matrix is a multiple of the /3-row, then a and 
13 will be in the same universality class. Indeed, in such cases, the a-characteristic 
need not be correlated with the probability deformations, but the a- and /9-model 
spaces nevertheless experience identical deformations. Phrased slightly differently, 
this means that although the a-model subspace experiences a non-trivial deformation 
in passing to the corresponding probability space, the /3-model subspace experiences 
exactly the same deformation. 

It turns out that Eq. ()3.13j) is not the most general condition which guarantees 
that a and j3 are in the same universality class, since we can also have situations in 
which there exist (subsets of) intrinsic probabilities pi such that Pi/pk = Pj/pe- In 
such cases, we do not need to demand the strict condition in Eq. (j3.13|) . but rather 
the more general condition 

— — = — — for all sets (i,j,k,£) for which — = — . (3-14) 

Np k Np e p k p e 

We shall therefore take this to be our most general definition for when two physical 
characteristics a and f3 are in the same universality class. However, it is easy to see 
that even when Pi/pk = Pj/pe has no solutions with i ^ k and j ^ £, there will always 
exist the trivial solution when i = k and j = i. In this case, Eq. f|3.14|) reduces back 
to Eq. (jnSJ) . 

Regardless of the relations between the different physical characteristics a, the 
fundamental problem that concerns us can be summarized as follows. As we construct 
model after model, we can keep a running tally of M a (D) for each relevant physical 
characteristic a (or for each relevant combined set of characteristics a) . Equivalently, 
this information may be expressed as M a (d a ), where we express the number of mod- 
els M a as a function of d a , the numbers of draws which have yielded an a-model 
regardless of whether that model has not previously been seen. Ultimately, on the 
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basis of this information, our goal is to determine correlations between these sets of 
characteristics across the entire landscape — i.e., we wish to determine the values of 
ratios such as N a /Np, where N a = J2iN a i- However, we now see that we face two 
fundamental hurdles: 

• First, it is not possible in practice to determine N a /Np from the (M a (D)) or 
(M a (d a )) because we do not have prior information about the partial population 
matrix N ai or the individual probabilities Pi, both of which enter into Eq. (J3.12)) . 
Indeed, even if we were willing to do a numerical fit and had sufficient statistical 
data with which to conduct it, we do not even know the number of distinct 
exponentials which enter into the sums in Eq. ()3.12|) . and it is always possible 
to improve accuracy (and thereby dramatically change the resulting best-fit 
values for the N a i) simply by introducing additional exponentials into the sum. 

• Second, even if we could solve the mathematical problem of extracting N a /Np 
from (M a (D)) or (M a (d a )), we do not know to what extent (M a (D)) or 
(M a (d a )) can be taken to approximate the exact, discrete integers M a (D) or 
M a (d a ) that are actually measured. Clearly, we expect that this approxima- 
tion should become very good as we explore sufficiently large portions of the 
corresponding entire model spaces, but we cannot a priori determine when this 
approximation might actually be valid because we do not know the absolute 
populations N ai of these spaces. 

Thus, it would clearly be an error to assume that N a /Np can be identified as 
M a (D)/Mp(D) for any particular value of D (unless, of course, we have already 
saturated the model space, with D 3> N aj p). Indeed, if we were to make this error, 
we would find that our proposed ratio N a /Np would "float" — i.e., it would evolve 
as a function of D. This is, ultimately, the problem of floating correlations. This 
behavior is illustrated in Fig. |3J which shows the results of an actual simulation in the 
simple case in which there are only two populations in each variable (a and i) and 
where the population matrix N ai is diagonal. Even in this dramatically simplified 
case, we see that our observed ratios of models float dramatically as a function of 
sample size, reaching the true value only when the full model space has been reached. 

Having described the problem in mathematical terms, we shall now propose a 
solution. The solution is relatively simple in principle, but its proper implementation 
is somewhat subtle. We shall therefore defer a discussion of its implementation to 
the next section. 

We shall begin by concentrating on the simplest case in which the population 
matrix N a { is diagonal. In this case, all physical properties of interest are perfectly 
correlated with the different probability deformations. Thus, we can identify the 
a-population with some value i, the /^-population with some value j, and so forth. 

Our goal is to generate a value representing N a /Np for some pre-determined (sets 
of) physical characteristics a and (3. However, all we can do is make repeated draws 
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Figure 3: Results of an actual numerical simulation showing how a correlation can "float" 
as the model space is explored. For this simple example, we have assumed only two disjoint 
populations of models with different intrinsic probabilities and likewise assumed only two 
physical characteristics (denoted "good" and "other"). Moreover, we assumed a complete 
correlation between probabilities and physical characteristics, so that our 2x2 population 
matrix is diagonal. We assumed a model space consisting of 310,000 models, one third of 
which are designated "good"; likewise, this simulation was run repeatedly with different 
probability ratios 7 = p goo d/Pother reflecting the intrinsic bias of our model-generation 
procedure. Despite all of these simplifying assumptions, we see that our correlations float 
very strongly as a function of the sample size of distinct models found, with the ratio of 
the numbers of "good" models to total models approaching the true value (=1/3) only 
when the total model space is explored. We also see that our statistics from this random 
simulation do not follow any semblance of smooth behavior until we have examined at least 
20,000 distinct models, representing approximately 6% of the total model space. 
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from the urn, slowly developing tallies M a ^(D) or M a ^(d a ^) of the distinct models 
in these respective classes. As we continue in this process of drawing from the urn, 
it becomes increasingly difficult to find new, hitherto-unseen models in each class. 
Indeed, viewing these classes as entirely separate, we see that the probability that 
each model selected from class a or class f3 will not have previously been seen is 



P (a) id 

1 new\ u o 



1 



M a (d a ) 



1 - 



M p {dp) 



(3.15) 



These probabilities can be taken as measures of how fully a given model space is 
explored. Therefore, rather than attempt to identify 



Np 



M a (D) 
M P {D) 



(3.16) 



for any single value of D, our solution is to instead identify 



N a 


M a {d'a) 




N p 




1 new K^a I x newV"^/ 



(3.17) 



for two different draw values d' a and d'L which are chosen such that their respective 
production probabilities are equated. Note that while the quantity d' a will correspond 
to a certain total draw count D', the quantity d'L will generally correspond to a 
different total draw count D" . In other words, we do not extract the desired ratio 
N a /Np by comparing M a (D) and Mp(D) at the same simultaneous point in the search 
process; rather, we compare the value of M a measured at one point in the search 
process (i.e., after D' total draws) with the value of Mp measured at a different point 
in the process (i.e., after D" total draws). As indicated in the condition in Eq. (|3.17|) . 
these different points are related by the fact that they correspond to points at which 
the corresponding a- and (3-model spaces are equally explored. This then completely 
overcomes the biases that result from the fact that the different model spaces are 
generally being explored at different rates. 

Of course, in the process of randomly generating string models, we cannot nor- 
mally control whether a random new model is of the a- or /3-type. Both will tend 
to be generated together, as part of the same random search. Thus, if D" > D', our 
procedure requires that we completely disregard the additional a-models that might 
have been generated in the process of generating the required, additional /3-models. 
This is the critical implication of Eq. (|3.17|) . Rather than let our model-generating 
procedure continue for a certain duration, with statistics gathered at the finish line, 
we must instead establish two separate finish lines for our search process. Of course, 
these finish lines are arbitrary and must be chosen such their respective a- and (3- 
production probabilities are equated. However, these finish lines will not generally 
coincide with each other, which requires that some data actually be disregarded in 
order to extract meaningful statistical correlations. 
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Thus far, we have been describing the situation in which we seek to obtain statis- 
tics comparing only two different groups of physical characteristics a and (3. In 
general, however, we might wish to compare whole sets of physical characteristics 
{a, (3, 7, ...}. Our procedure then requires that we establish a whole host of corre- 
lated finish lines, one for each set of physical characteristics, and use Eq. (I3.17j) to 
make pairwise comparisons. 

Given Eq. (|3.15|) . the result in Eq. (|3.17|) follows quite trivially from the condi- 
tion that Pncl(d' a ) = P^(d'f) . The simplicity of this statement may even seem to 
be a tautology, and indeed the difficulty in extracting the desired correlation ratio 
N a /Np is now reduced to the practical question of determining when the probability 
condition in Eq. (|3.17|) is satisfied. This will be the focus of the next section. How- 
ever, the important point is that we can overcome all of the biases inherent in the 
model-generation process by focusing on the probabilities for generating new distinct 
models, and by comparing the numbers of models which have emerged at different 
points in the model-generation process — points at which these respective production 
probabilities are equal. 

As indicated above, Eq. ()3.17|) has been derived for the simple case in which the 
population matrix N ai is diagonal. However, as long as a and (3 are in the same 
universality class, it turns out that this result also holds for the more general case in 
which our populations a and f3 are non-trivially distributed across different probabili- 
ties Pi. This statement is proven in the Appendix, and as we shall see below, this case 
actually covers a large fraction of physically interesting characteristics. Thus, even 
in this case, we can overcome the biases inherent in the model-generation process by 
focusing on the probabilities for generating new distinct models at different points in 
the model-generation process. 

4 Equating probabilities, and the uses of attempts/model 

The fundamental task that remains is to develop a method of measuring the 
restricted probabilities P^^\d ai (f) which appear in Eq. (j3.17|) . or at least to develop 
a method of determining when these probabilities are equal. At first glance, it might 
seem that this should be a relatively simple undertaking. Since we naturally generate 
data such as M a ^(d a) f3) in the course of our model search, it might seem that we 
could determine the individual model-production probabilities simply by taking a 
derivative: 



Unfortunately, it turns out that taking such a derivative is computationally un- 
feasible. The reason is that whereas the theoretical expectation value (M a (D)) is a 
smooth, continuous function, the actual "measured" quantities M a (D) are necessar- 
ily discrete, jumping from integer to integer at unpredictable values of D or d a . Of 
course, one could perhaps extract (M(D)) by repeating the same model-generation 
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process over and over and averaging the results, but this is computationally expensive 
and redundant — hardly an efficient solution for a problem which has only arisen in 
the first place because our computational power is already stretched to the maximum 
extent. 

Indeed, the overall problem is that the production probabilities P^^\D) — which 
are the only true legitimate measure of the degree to which a model space is explored 
- fail to be a computationally practical measure because they are extremely sensitive 
to the difference between M a (D) and (M a (D)). What we require, by contrast, is an 
alternative measure of the extent to which a given model space is explored, a measure 
which may be only approximate but which is less sensitive to the difference between 
M(D) and (M(D)) and which can therefore be implemented in an actual automated 
search through the model space. 

To get an idea how to proceed, let us begin by considering the simplified case in 
which the population matrix N a i is actually diagonal. In this case, all physical prop- 
erties of interest are perfectly correlated with the different probability deformations, 
so that we can identify the a-population with some value i, the /^-population with 
some value j, and so forth. Thus we expect (M a (D)) (or equivalently (Mj(D)) for 
some i) to follow Eqs. ()3.8|) and (|3.11|) . Given these equations, it then follows that 



N, MAD') 



Ni M 3 {D") 
only when we satisfy the balancing condition 

D'Pi D"Pj 

^prob ^prob 

or equivalently 

N Nj ' 



(4.2) 



(4.3) 



(4.4) 



Since we do not know the values of the pi or the N, it is not possible to determine 
the balanced pairs of values (D',D") or ((gQ, ((£-)) using these equations. However, 
since Eq. Oil implies Eq. (Ojl . we can multiply each side of Eq. (Oil by iVj/(Mj(4)) 
or Nj/ (Mj(dj)) respectively to obtain the equivalent balancing condition 

«> - <<?> (4.5) 



(MiK)} (M,(4)> ' 

Unlike Eq. (|4.3|) . this balancing equation is easy to interpret and implement in a 
computer search since (di/Mi(di)) is nothing but the expectation value of the ratio 
of 'attempts' to 'models', where 'attempts' refers to the total number of i-models 
drawn and 'models' refers to the total number of actual distinct i-models drawn. 
Thus, we can view our balancing condition as one which equates cumulative attempts 
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per model, where our attempts are restricted to those which yielded a model (whether 
distinct or not) in the appropriate class. 

Of course, it may initially seem that attempts/model is no better than production 
probabilities since they are both essentially equivalent when the population matrix 
is diagonal or has rescaled rows. However, the important point is that since at- 
tempts/model does not involve a derivative of M(D), this quantity is actually less 
sensitive to the difference between M(D) and (M(D)) than the production probabil- 
ities P^ew- Thus, we may replace 

«> (4.6) 



(M,K)) M,«) 

in Eq. (|4.5J) without seriously damaging our ability to extract the desired ratio Ni/Nj 
(or N a /Np). This fact is illustrated in Fig. HI 
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Figure 4: Results of a numerical simulation involving the same setup as in Fig. El but 
now with Ngood/N&u extracted through Eq. ()4.7|) and plotted as a function of the value 
of dgood/^good(dgood) = ^other /Mother (do ther ). As we see, use of this method enables us to 
extract the correct value Ng^d/N^n = 1/3 with considerable accuracy even for relatively 
small values of attempts/model, regardless of the value of the bias 7. We also ran similar 
simulations in which each of the models in the model space was subjected to an additional 
arbitrary probability deformation; as long as the condition in Eq. (|3.14|) was enforced, the 
resulting plot remained essentially unchanged. 



18 



This result is valid for the case when the population matrix is diagonal. However, 
it is straightforward to see that these results also hold for any {a, (3, ...} which are in 
the same universality class [as defined in Eq. (|3.14|) ]. Because the probability spaces 
corresponding to models with each of these characteristics have identical deformation 
patterns, we can repeat the above derivation and find that attempts/model continues 
to be a fairly accurate measure parametrizing the degree to which a given model space 
is explored. As it turns out, many physical characteristics of interest {oe,/3, ...} have 
the property that they share identical probability deformations for a given model- 
construction formalism, and are thus in the same universality class. Thus, for these 
characteristics, attempts/model can be used in place of production probabilities in 
extracting population ratios: 



N a M a (d' a 



Nfi M p {d"p) 



M a (d' a ) Mpid'p 



(4.7) 



Indeed, we can "experimentally" verify whether our chosen physical characteristics a 
and P are in the same universality class by calculating the ratio N a /Np as a function 
of the chosen number of attempts/model using this relation, and verifying that this 
ratio does not experience any float as a function of attempts/model. The absence of 
any float indicates that the physical characteristics (a, (3) are in the same universality 
class. We shall see explicit examples of this situation in Sect. 
One important cross-check is to verify that 

Np Ar 7 Np 1 ' ; 

for all (a,/3, 7) in the same universality class, where each of these fractions is indi- 
vidually extracted through Eq. (|4.7jl . Since Eq. ()4.8j) is not guaranteed to hold on 
the basis of the definition in Eq. ()4.7|) . its validity provides an important check on 
any results we obtain. 

It is important to note that this procedure only yields a set of relative abundances 
of the form N a /Np within the same universality class. This is usually the best one 
can do. However, it is occasionally possible to convert this information to absolute 
proportions of the form N a /N a \\. For example, if the characteristics {a, f3, 7, ...} are 
non-overlapping, all in the same universality class, and happen to span the entire 
space of possible physical characteristics, then N a n = N a + Np + N^ + ... and we can 
therefore extract the absolute probabilities: 

N a N a f Np N^ y 1 

a = = N a+ Np+N, + ... = [ 1 + n; + n; + -) ■ (4 - 9) 

Alternatively, we can sometimes avoid this procedure by simply letting a denote the 
complement of a (i.e., the characteristic that a given model does not contain the 
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characteristic associated with a) and calculate N a /Na- If a and a happen to be in 
the same universality class, then our result for N a /N-^ will be stable and Q a can then 
be extracted through Eq. ()4.9|) where we identify N„ = Np + N y + .... Calculating 
Q a for one member a of a given universality class will then enable us to obtain flp 
for every other member (3 of the class. However, we stress that this relies on the 
assumption that a and a are in the same universality class, a situation which is not 
guaranteed to be the case. 

Finally, of course, we may face the most general situation in which two physical 
characteristics a and (3 are not in the same universality class. In such cases, even 
the ratios N a /Np determined through Eq. (|4.7|) will float as a function of the number 
of attempts/model. Indeed, as mentioned above, this provides a test (indeed, the 
only viable test) of whether two physical characteristics a and (3 are truly in the 
same universality class. However, even if a and (3 are not in the same universality 
class, it may nevertheless be possible to extract individual absolute probabilities fl Q 
and Qp through Eq. (|4.9|) if a and (3 are in the same universality classes as a and 
(3 respectively. We can then indirectly calculate the relative probability N a /Np = 

Even when {a, (3, 7} are not in the same universality class, the cross-check in 
Eq. ()4.8|) must continue to hold. However, each individual fraction will not be stable 
as a function of attempts/model unless it is determined indirectly through Q aj p t7 . For 
example, let us imagine that a and (3 are in the same universality class but 7 is in 
a different universality class. In this case, we can use Eq. (|4.7|) to obtain each of the 
fractions in Eq. (|4.8|) . but the two factors on the right side of Eq. (14. 7|) will float as a 
function of attempts/model, constrained only by the requirement that their product 
be fixed. However, determining these factors through their absolute ^-probabilities 
will enable stable results to be reached. 

We see, then, that our solution to the problem of floating correlations involves 
more than simply tallying the populations of different models generated in a random 
search — it also requires information about how they were generated, and in par- 
ticular how many attempts at producing a distinctly new model are required before 
a given such model is actually found. While this represents new data which might 
not otherwise have received any special attention, we see that it is this new ingredi- 
ent which enables us to evaluate the degree to which a given model space has been 
explored. Moreover, it is relatively easy to keep track of this information during the 
model-generation process. 

In this connection, it is important to note that attempts/model can also have ad- 
ditional important uses beyond Eq. (|4.7jl . For example, attempts/model can be used 
as a measure of the extent to which a given model space has been explored — even in 
the presence of an unknown model-generation bias. Thus, use of attempts/model can 
allow comparisons between model spaces of different (unknown) sizes. This property 
is illustrated in Fig. El which shows that use of attempts/model can completely elimi- 
nate the effects of differing model-space volumes. However, it is clear from these plots 
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Figure 5: Top figures: Two different ways of presenting the same simulation results. In 
each graph, we have plotted the results from three simulations in which the number of 
models designated "good" (according to a pre-defined characteristic) is one-third of the 
total. However, these simulations differ in the total size of the model space, with total 
model-space volumes taken to be $7, $7/5, and £7/10 where $1 = 310000. Each simulation 
also incorporated a model-generation probability bias 7 = fgood/Pother = 1/3. We see that 
the observed ratio N goo d/N a \\ floats in each case, but plotting these results as a function of 
sample size (as in the left figure) does not allow us to separate the effects of the bias from the 
effects of the different volumes. However, plotting these results in terms of attempts/model 
(as in the right figure) enables us to completely eliminate the effects of the differing model- 
space volumes. Bottom figure: The results of the same simulation, but now with differing 
bias ratios 7. We see that unlike the effects of differing volumes, the effects of bias cannot 
be overcome simply by considering model spaces at similar levels of exploration. 
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that the existence of model-generation bias can continue to make a determination of 
-Wgood /-Mother impossible until the model space is nearly fully explored. Indeed, we see 
that the amount of model space which must be explored in order to overcome the 
bias depends on the value of 7. 

5 A het erotic example 

In this section, we shall illustrate the above ideas and their implementation in an 
actual example drawn from the heterotic string landscape. As we shall see, the use 
of these ideas leads to correlations that differ markedly from those which would have 
naively been apparent from only a partial data set. 

The models we shall examine are all four-dimensional perturbative heterotic string 
models with Af = 1 spacetime supersymmetry, formulated through through the free- 
fermionic construction ^3]. In the language of this construction, worldsheet confor- 
mal anomalies are cancelled through the introduction of free fermions on the world- 
sheet, and different models are realized by varying (or "twisting") the boundary 
conditions of these fermions around the two non-contractible loops of the worldsheet 
torus while simultaneously varying the phases according to which the contributions 
of each such spin-structure sector are summed in producing the one-loop partition 
function. For the purposes of our search, all worldsheet fermions were taken to be 
complex with either Neveu-Schwarz (anti-periodic) or Ramond (periodic) boundary 
conditions. However, we emphasize that alternative but equivalent languages for con- 
structing such models exist. For example, we may bosonize these worldsheet fermions 
and construct "Narain" models [To"! [TOj in which the resulting complex worldsheet 
bosons are compactified on internal lattices of appropriate dimensionality with ap- 
propriate self-duality properties. Furthermore, many of these models have additional 
geometric realizations as orbifold compactifications with randomly chosen Wilson 
lines; in general, the process of orbifolding is quite complicated in these models, 
involving many sequential overlapping layers of projections and twists. 

A full examination of these statistical correlations for such M = 1 string models 
will be presented in Ref. [T7|. Indeed, many of the techniques behind our model- 
generation techniques and subsequent statistical analysis are similar to those de- 
scribed in Ref. j§] . However, our goal here is merely to provide an example of how 
certain statistical correlations float, and how stable results can nevertheless be ex- 
tracted. 

Towards this end, we shall restrict our attention to a simple question: with what 
probabilities do certain gauge-group factors appear in the total (rank-22) gauge group 
of such M = 1 string models? To address this question, we randomly constructed 
a set of ~ 3.16 million distinct models in this class. This set of models is 25 times 
larger than that examined in Ref. 0, and thus represents the largest set of distinct 
heterotic string models which have ever been constructed to date. We emphasize 
that the distinctness of these models is measured, as discussed in Sect. 2, on the 
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basis of their resulting physical characteristics in spacetime and not on the basis of 
the internal worldsheet parameters from which they are derived. 

One feature which is immediately apparent from such models is that while U(l) 
and SU{2) gauge-group factors are fairly ubiquitous, SU (3) gauge-group factors are 
relatively rare. Indeed, if we restrict our attention to the first 1.25 million models 
that were generated in this set, we find that more than 90% of these models exhibit at 
least one U(l) or SU(2) gauge-group factor, while less than « 50% of these models 
exhibit an SU(3) gauge-group factor. Thus, we have what appears at first glance 
to be a striking disparity: SU(3) gauge-group factors appear to be significantly less 
likely to appear than £7(1) or SU{2) gauge-group factors, at least in this perturbative 
heterotic corner of the landscape. 

However, an alternative explanation might simply be that our model-construction 
technique (in this case, one involving free worldsheet complex fermions with only pe- 
riodic or anti-periodic worldsheet boundary conditions) may have certain inherent 
tendencies to produce models with U(l) or SU(2) gauge-group factors more easily 
than to produce models with SU(3) gauge-group factors. Indeed, even though this 
construction technique may ultimately be capable of producing more models with 
SU(3) gauge-group factors than U{1) or SU{2) gauge-group factors (thereby causing 
the SU{3) models to occupy a larger relative volume of the associated model space), 
it may simply be that the models with SU (3) gauge-group factors may be more dif- 
ficult to reach and thus occupy a smaller volume within the associated probability 
space. If this is true, then we cannot hope to reach any conclusion about the rel- 
ative abundances of U(l), SU(2), and SU(3) gauge-group factors on the basis of a 
straightforward census of the models we have generated. 

Again, we emphasize that this is not a problem unique to the free-fermionic con- 
struction. Literally any construction procedure will have an intrinsic bias towards 
or against certain string models, yet this need not have anything to do with the ul- 
timate statistical properties across the corresponding model spaces. Thus, since we 
can examine at best only a necessarily finite sample of models, it is clear that we are 
not able to extract any meaningful information from a census study of a finite model 
sample alone. 

One clue that we are indeed dealing with a model-construction bias in this exam- 
ple comes from examining the percentage of models exhibiting an 577(3) gauge-group 
factor as a function of the number of distinct models we generated at different points 
in our search. This data is plotted in Fig. El for the first 1.25 million models, and it 
is immediately clear that the percentage of models with SU(3) gauge-group factors 
floats rather significantly as a function of the sample size. This implies that models 
with SU (3) gauge-group factors occupy a smaller relative volume within the proba- 
bility space of models than within the true model space itself. We emphasize that 
this need not have been the case: it could have turned out that gauge groups were 
uniformly distributed among the populations of models with different probabilities 
of production. However, by examining gauge-group correlations as a function of the 
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number of models generated, we now have clear evidence that this is not the case. 
Therefore, before we can draw any conclusions concerning the relative probabilities 
of specific gauge-group factors for such string models, we must compensate for this 
distortion of the probability space relative to the model space. 
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Figure 6: The percentage of distinct four-dimensional AT = 1 supersymmetric heterotic 
string models exhibiting at least one SU(3) gauge-group factor, plotted as a function of the 
number of models examined for the first 1.25 million models. We see that as we generate 
further models, SU(3) gauge-group factors become somewhat more ubiquitous — i.e., the 
fraction of models with this property floats. This implies that models with SU(3) gauge- 
group factors occupy a smaller relative volume within the probability space of models than 
within the true model space itself. One must therefore correct for this distortion of the 
probability space relative to the model space before any conclusions concerning the relative 
probabilities of specific gauge-group factors can be drawn. 

At first glance, it might seem from Fig. H that the proportion of models with 
SU(3) gauge-group factors appears to be saturating somewhere near 50% or 60%. 
However, we must remember that the true size of the full model space is unknown. 
This means that even though the proportion of models with SU(3) seems to be 
floating very slowly, it is difficult to judge how long this floating might continue if we 
were able to examine more models. Even a small degree of floating could accumulate 
into a large change in the apparent frequency of SU(3) gauge factors. Moreover, we 
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are generally concerned with correlations — i.e., relationships between two or more 
different variables. For example, we might be concerned with a correlation between 
the appearance of an SU(3) gauge-group factor and the appearance of an SU(2) 
gauge-group factor or a U(l) gauge group factor. Since each individual gauge-group 
factor may experience its own degree of floating, the net float of the correlation can 
be quite strong even if these individual floats are rather weak. 

In order to address these difficulties, we can therefore employ the methods outlined 
in the previous section. For example, we can let a represent the physical characteristic 
that a string model contains a (7(1) gauge-group factor, (3 represent the same for 
SU(2), 7 the same for SU(3), and so forth. If these a, (3, and 7 characteristics are in 
the same universality class [as defined in Eq. ([3.14]) ]. we can use Eq. (14.7]) directly to 
extract N a /Np, Np/N y , and so forth. Indeed, calculating these ratios as a function of 
attempts/model, we can verify whether a, (3, are 7 are truly in the same universality 
class. Moreover, even when these characteristics are not in the same universality 
class, we can use the method outlined in Eq. ([4.9]) to obtain absolute probabilities Q a 
when a and a are in the same universality class. In such cases, we can then convert 
all of our final information to the same absolute scale Q a . 

Our results are shown in Table For each listed gauge-group factor, we list the 
percentage of models containing this factor at least once (tallied across our sample 
consisting of the first 1.25 million distinct four-dimensional M = 1 heterotic string 
models we generated) as well as the percentage to which this sample result ultimately 
"floats", as extracted through Eqs. ([4.7]) and ([4.9]) . Although not directly evident 
from the entries in this table, it turns out from our analysis that each of these group 
factors is in the same free-fermionic universality class, at least as far as we can 
determine numerically. Moreover, we were able to verify (again within numerical 
error) that a and a are in the same universality class for the case when a represents 
the SU(5) characteristic. It was through this observation that we were able to convert 
the relative probabilities N a /N/s into the absolute probabilities VL a ^ quoted in Tabled] 

As is evident from Tabled the effects of such floating can be rather significant, re- 
sulting in relative percentages Q a which often differ significantly from the percentages 
which are evident in only the finite sample set. Perhaps the most significant example 
of this can be found in the relation between the SU5 and SU± columns in Table ^ 
At relatively low levels of exploration, one would easily conclude that SO(2n) groups 
(such as SU4 ~ 5*06) are more common than SU{n > 3) groups, since every S0{2n) 
group has a higher probability of occurring than the corresponding SU(ri) group of 
the same rank. However, when the full model space is extracted, it is clear that 
actually the reverse is true: the 'SU' groups actually dominate the model space even 
though they do not dominate the probability space. Indeed, the apparent paucity 
of 'SU' groups in our finite sample indicates nothing more than their difficulty of 
construction — a feature which is completely unrelated to their overall abundance 
within this class of string models. We see, then, that the issue of floating correlations 
can be rather important in any attempt to obtain statistical correlations through 



25 



group 


finite sample 


extracted Q a 


TL 


qq 04 




OU2 


q7 44 


qs 9 




47 84 


Q7 f> 


kJ (_/ 4 


51 D4 

O ± . Ut: 






7 1f\ 


41 6 






1 72 
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4.83 


0.21 


S'0 > io 


2.69 


0.054 


Eej,8 


0.27 


0.023 



Table 1: Percentage of four-dimensional TV = 1 supersymmetric heterotic string models 
containing various gauge-group factors at least once in their total gauge groups. Here 
«ST/>5 indicates the appearance of any SU{n > 5) factor, while S'0 > io indicates any SO(2n) 
group with n > 6 and £6,7,8 signifies any of the 'E' groups. For each gauge-group factor, 
the 'sample' column indicates to the percentages of models exhibiting this factor across 
our sample of more than one million distinct models in this class. By contrast, the £l a 
column lists the corresponding values to which these percentages would "float" , as extracted 
through Eqs. (|4.7jl and/or (|4.9|) . It is clear that correcting for such probability deformations 
can result in abundances which are markedly different from those which appear within a 
finite sample. 

examination of only a finite data set. 

We emphasize that although the procedures outlined in the previous section are 
fairly robust, there can be numerous numerical/computational difficulties which can 
cloud or obscure these results. For example, we found that it was much more difficult 
to extract information concerning the SU(3) gauge-group factor than for almost any 
other factor. We attribute this to the fact that the SU (3) gauge-group characteristic 
is predominantly distributed amongst models with extremely small intrinsic prob- 
abilities pi in this construction, making it difficult to reach significant penetration 
into this set with sufficiently large values of attempts/model. Moreover, as we have 
stressed in Eq. (|4.6jh the actual numbers of attempts/model, just like the actual 
numbers of models generated, are only approximations to their mathematical expec- 
tation values. When attempting to extract correlations between models whose pi are 
of hierarchically different sizes, these numerical issues can become severe. These nu- 
merical issues must therefore be dealt with on a case-by-case basis when attempting 
to extract correlations from the landscape. 

Given the results in Tabled one might wonder why we did not quote joint proba- 
bilities for the composite Standard-Model gauge group Gsm = SU(3) x SU(2) x U(l) 
or the composite Pati-Salam gauge group Gps = SO (6) x SO (A) in Table [T] The 
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reason is that these composite groups Gsm and Gps do not appear to be in the 
same universality classes as their individual factors. This, coupled with the numeri- 
cal difficulties of dealing with apparently small pi, makes an analysis for these cases 
significantly more intricate. The results for these cases will be given in Ref. |17j . 

6 Discussion 

In this paper, we have investigated some of the issues which challenge attempts to 
randomly explore the string landscape. We identified an important generic difficulty 
- the problem of "floating correlations" — and presented a method for overcoming 
this difficulty which is applicable in a large variety of cases. Moreover, we found that 
properly compensating for these floating correlations can lead to statistical results 
which differ, in many cases substantially, from the results which would have emerged 
from direct statistical examination of only a partial data set. We therefore believe 
that recognition of and compensation for these effects are absolutely critical, and must 
play a role in any future string landscape study which operates through a random 
generation of string models. 

It is worth emphasizing that this entire difficulty ultimately stems from our un- 
derlying ignorance of the properties of the functions (discussed in Sect. 2) which map 
internal string-construction parameters into spacetime physical observables. If we 
had an explicit and usable representation for these functions, we could avoid this 
whole problem completely since we could analytically (or computationally) account 
for this kind of bias directly in our model-generating process. It is only because of the 
difficulty of analyzing such functions in a general way that we are forced into situa- 
tions in which our model spaces experience such significant probability deformations. 
These sorts of concerns also fail to play a role in various field-theoretic analyses of 
the landscape [T5] . 

It is also worth emphasizing that although we have focused in this paper on the 
specific problem of surveying string models in a way suitable for string landscape 
studies, the mathematical problem we have been dealing with is actually far more 
general, arising in all generic situations in which we seek to scan one space (such 
as the model space) while we only have direct computational access to a second 
space (such as the probability space) whose relations to the first space are generally 
unknown or difficult to analyze analytically. Thus, we expect our approach to this 
problem to have general applicability as well. 

Despite these facts, there are still many issues which are left unresolved by our 
methods. Some of these issues are numerical and computational — for example, 
one must develop techniques of overcoming other sorts of numerical instabilities and 
fluctuations which transcend the bias issues we have been discussing, but which nev- 
ertheless can be significant. Other issues are more abstract and mathematical — for 
example, one must eventually develop new and efficient methods of generating and 
classifying string vacua. One also requires additional theoretical input into the all- 
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important question of determining which measure is ultimately the most appropriate 
for landscape calculations. Finally, other issues are more detailed and potentially 
intractible — for example, although we have given a procedure for extracting statis- 
tical correlations between physical observables in the same universality class, we have 
not provided any procedure for relating physical observables in different universality 
classes. Barring successful resolution, all of these are critical issues which will likely 
hamper future statistical studies of the string landscape. 

There are also other challenges which are inherent to all attempts at statistical 
explorations of the string landscape, be they numerical or analytic, randomized or 
systematic. Although discussed elsewhere (see, e.g., Ref. (HJ), we feel that they bear 
repeating because of their generality. 

One of these has been termed the "Godel effect" - - the danger that no mat- 
ter how many conditions (or input "priors") one demands for a phenomenologically 
realistic string model, there will always be another observable for which the set of 
realistic models will make differing predictions. Therefore, such an observable will 
remain beyond our statistical ability to predict. (This is reminiscent of the "Godel 
incompleteness theorem" which states that in any axiomatic system, there is always 
another statement which, although true, cannot be deduced purely from the axioms.) 
Given that the full string landscape is very large, consisting of perhaps 10 500 distinct 
models or more, the Godel effect may represent a very real threat to our ability to 
ultimately extract true phenomenological predictions from the landscape. 

Another can be called the "bulls-eye" problem — the realization that since we 
cannot be certain how our low-energy world is ultimately embedded into a string 
framework, we do not always know physical characteristics our "target" string mod- 
els should possess. For example, we do not know whether our world becomes super- 
symmetric as we move upwards in energy, or whether strong-coupling effects develop 
which completely change our perspective on microscopic physics. We do not know 
whether our world remains essentially four- dimensional as we move upwards towards 
the string scale, or whether there exist extra spacetime dimensions (large or small, 
flat or warped) which become evident at intermediate scales. Indeed, it is possible 
that nature might pass through many layers of effective field theories at higher and 
higher energy scales before reaching an ultimate string-theory embedding. Absence 
of knowledge concerning the appropriate string-theory embedding thereby limits our 
ability to identify which statistical information about the string landscape is the most 
important to extract. 

A third challenge can be termed the "lamppost" effect — the danger of restricting 
one's attention to only those portions of the landscape where one has control over 
calculational techniques. Ultimately, barring a complete classification of all consistent 
string vacua, there is always the danger that there exists a huge sea of unexplored 
string models whose properties are sufficiently novel that they would invalidate any 
statistical conclusion we might have already reached. This danger exists regardless 
of how detailed or comprehensive an analysis we may have just performed. Indeed, 
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at any moment in time, our knowledge of string theory and various constructions 
leading to consistent string models is, by necessity, quite limited. A decade ago, 
one would have considered the heterotic strings alone to have comprised the set 
of phenomenologically viable string models. The advent of the second superstring 
revolution has opened the doorway to studies of Type I strings, and recent realizations 
concerning flux vacua have led to new ideas concerning moduli stabilization. It is 
impossible to predict what the future might hold, and thus it might be argued that 
any statistical analysis of known vacua is at best premature. 

Closely related to this is the problem of unknowns — even within a given string 
construction. For example, the methods we have been describing in this paper for 
sampling string models randomly can eventually allow us to evaluate, with some 
certainty, how large a volume of the probability space of models might still have been 
missed in our search. However, although such a statistical study might be able to 
place an upper bound on the volume that such unexplored models might occupy in the 
probability space, this does not translate into any bound on the corresponding volume 
that such models might occupy in the model space. Thus, as long as such models 
have sufficiently small intrinsic probabilities Pi, their total number can essentially 
grow without bound and yet remain unobservable. 

Despite these observations, we are not pessimistic about statistical explorations 
of the landscape. Instead, we feel that efforts to take this exploration seriously are 
important and must continue. As string phenomenologists, we cannot hope to make 
progress without ultimately coming to terms with the landscape. Given that large 
numbers of string vacua exist, it is imperative that string theorists learn about these 
vacua and the space of resulting phenomenological possibilities. As already noted 
in Ref. 0, the first step in any scientific examination of a large data set is that of 
enumeration and classification. This has been true in branches of science ranging 
from astrophysics and botany to zoology, and it is no different here. However, before 
we can undertake this monumental enterprise, we will first need to develop an entire 
toolbox of statistical techniques and algorithms which are especially constructed for 
the task at hand. It is therefore our hope that the methods developed in this paper 
will represent one small but useful tool in this toolbox. 
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Appendix A 



Eq. (|3.17|) has been derived for the simple case in which the population matrix N ai 
is diagonal. However, as long as a and f3 are in the same universality class, it turns 
out that this result also holds for the more general case in which our populations a 
and (3 are non-trivially distributed across different probabilities pi. To see this, let 
us begin again with our probability condition 



Ptl(D') - P^l(D") = 



(A.l) 



For convenience, we shall write these expressions in terms of the total number of draws 
D rather than the individual counts d a . In general, these restricted probabilities are 
given by 



newV ) 



a 
i 

a 



[Npi — Mj3i(D'' 



(A.2) 



where M ai (D) denotes the number of distinct a-models already found in probability 
class pi and where are respectively the total probability-space volumes of the a- 
and /3-models: 

a* = ^PjN aj , Q/3 = J^PjNpj . (A. 3) 

3 3 

Substituting Eq. (|A.2|) into Eq. (jA.lj) then yields the condition 
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Let us now assume that a and f3 are in the same universality class, as defined through 
Eq. ([3. 13)1 : the more general definition in Eq. (j3.14j) can be handled through a reshuf- 
fling of indices in what follows. Given Eq. (j3.13|) . let us define the ratio 



7 



V, 



We then trivially see that 



Pi 



Npj 



for all ■ 



7 



Pi 



and 



T,iPiNg 



7 



Pi 



EiPiN, 



Pi 



7 



(A.5) 

(A.6) 
(A.7) 



Pi 



30 



whereupon it follows that the term in the first square brackets in Eq. (jA.4|) vanishes. 
Eq. (jA.4|) thus reduces to the condition 



M ai (D') Mpi(D") 



a 



a 



(A. 



However, this condition must hold for all appropriately balanced pairs (D',D"), 
since we want our results to be stable as a function of sample size. Indeed, there are 
literally an infinite number of such pairs (£)', D") for which we require that Eq. ifPjl 
hold, leading to a number of distinct constraints (|A.8|) which is guaranteed to exceed 
the number of probability populations. (One can prove this last statement through 
induction.) Given this, there is only one possible solution: we must have 



M ai (p') a 
m^d") ~ a 



7 



for all i. It then follows that 



M a (D') 



M P {D") J2i M/3i(D") 



7 



(A.9) 



(A.10) 



and in conjunction with Eq. (|A.6J) this yields Eq. ()3.17j) . as originally claimed. Thus, 
once again, we see that we can overcome all of the biases inherent in the model- 
generation process by focusing on the probabilities for generating new distinct models, 
as expressed in Eqs. (13.15)1 or (|A.2J) . and by comparing the numbers of models which 
have emerged at different points in the model-generation process at which these 
respective production probabilities are equal. 
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